- What is text clustering?
- What are the applications?
- How to cluster text data?
Basic properties
Positive separation \[𝐷(x,y)>0, \forall x \neq y\] \[𝐷(x,y)=0, \mathrm{i.f.f.}, x=y\]
Symmetry \[𝐷(x,y)=𝐷(y,x)\]
Triangle inequality \[𝐷(x,y)≤𝐷(x,z)+𝐷(z,y)\]
Minkowski metric
\(d(x,y) = \sqrt[p]{\sum^V_{i=1}{(x_i-y_i)^p}}\)
Cosine metric
\(𝑑(x,y)=1−cosine(x,y)\)
Edit distance
Count the minimum number of operations required to transform one string into the other
← Can be efficiently solved by dynamic programming
Edit distance
Count the minimum number of operations required to transform one string into the other
Extent to distance between sentences
Word similarity as cost of replacement
“terrible” -> “bad”: low cost → Lexicon or distributional semantics
“terrible” -> “terrific”: high cost → Lexicon or distributional semantics
Preserving word order in distance computation
Partitioning method: Construct a partition of \(n\) documents into a set of \(K\) clusters
Given: a set of documents and the number \(K\)
Find: a partition of \(K\) clusters that optimizes the chosen partitioning criterion
Typical partitional clustering algorithms
k-means clustering
Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, \(c\):
\[\vec \mu(c)=\frac{1}{|c|}\sum_{\vec a \in c}{\vec x}\]
Reassignment of instances to clusters is based on distance to the current cluster centroids.
Select \(K\) random docs \(\{s_1, s_2,… s_K\}\) as seeds.
Until clustering converges (or other stopping criterion):
For each doc \(d_i\):
(Next, update the seeds to the centroid of each cluster)
For each cluster \(c_j\)
Several possibilities, e.g.,
A fixed number of iterations.
Doc partition unchanged.
Centroid positions don’t change.
Why should the K-means algorithm ever reach a fixed point?
K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.
EM is known to converge.
Number of iterations could be large.
Results can vary based on random seed selection.
Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings.
Select good seeds using a heuristic (e.g., doc least similar to any existing mean)
Try out multiple starting points
Initialize with the results of another method.
Recomputing the centroid after every assignment (rather than after all points are re-assigned) can improve speed of convergence of K-means
Assumes clusters are spherical in vector space
Disjoint and exhaustive
Doesn’t have a notion of “outliers” by default
But can add outlier filtering
Typical hierarchical clustering algorithms
Bottom-up agglomerative clustering
Typical hierarchical clustering algorithms
Top-down divisive clustering
Start with all data as one cluster
Repeatedly splitting the remaining clusters into two
Starts with each doc in a separate cluster
The history of merging forms a binary tree or hierarchy.
Many variants to defining closest pair of clusters
\[sim(c_i, c_j) = \max_{x \in c_i, y\in c_j}{sim(x,y)}\]
Can result in “straggly” (long and thin) clusters due to chaining effect.
After merging \(c_i\) and \(c_j\), the similarity of the resulting cluster to another cluster, \(c_k\), is:
\[sim((c_i \cup c_j), c_k) = max(sim(c_i, c_k), sim(c_j, c_k))\]
\[sim(c_i, c_j) = \min_{x \in c_i, y\in c_j}{sim(x,y)}\]
Makes “tighter,” spherical clusters that are typically preferable.
After merging \(c_i\) and \(c_j\), the similarity of the resulting cluster to another cluster, \(c_k\), is:
\[sim((c_i \cup c_j), c_k) = min(sim(c_i, c_k), sim(c_j, c_k))\]
\[D \approx UV^T\]
Maximize similarity between entries of \(D\) and \(UV^T\)
subject to:
Constraints on \(U\) and \(V\)
lexicon: lion, tiger, cheetah, jaguar, porsche, ferrari
\[ D = \begin{pmatrix} & \text{lion} & \text{tiger} & \text{cheetah} & \text{jaguar} & \text{porsche} & \text{ferrari} \\ \text{Document-1} & 2 & 2 & 1 & 2 & 0 & 0 \\ \text{Document-2} & 2 & 3 & 3 & 3 & 0 & 0 \\ \text{Document-3} & 1 & 1 & 1 & 1 & 0 & 0 \\ \text{Document-4} & 2 & 2 & 2 & 3 & 1 & 1 \\ \text{Document-5} & 0 & 0 & 0 & 1 & 1 & 1 \\ \text{Document-6} & 0 & 0 & 0 & 2 & 1 & 2 \end{pmatrix} \]
\[ \begin{align} D &\approx Q\Sigma P^T \\ &\approx \begin{pmatrix} -0.41 & 0.17 \\ -0.65 & 0.31 \\ -0.23 & 0.13 \\ -0.56 & -0.20 \\ -0.10 & -0.46 \\ -0.19 & -0.78 \end{pmatrix} \begin{pmatrix} 8.4 & 0 \\ 0 & 3.3 \end{pmatrix} \begin{pmatrix} -0.41 & -0.49 & -0.44 & -0.61 & -0.10 & -0.12 \\ 0.21 & 0.31 & 0.26 & -0.37 & -0.44 & -0.68 \end{pmatrix} \\ &= \begin{pmatrix} 1.55 & 1.87 & 1.67 & 1.91 & 0.10 & 0.04 \\ 2.46 & 2.98 & 2.66 & 2.95 & 0.10 & -0.03 \\ 0.89 & 1.08 & 0.96 & 1.04 & 0.01 & -0.04 \\ 1.81 & 2.11 & 1.91 & 3.14 & 0.77 & 1.03 \\ 0.02 & -0.05 & -0.02 & 1.06 & 0.74 & 1.11 \\ 0.10 & -0.02 & 0.04 & 1.89 & 1.28 & 1.92 \end{pmatrix} \end{align} \]
The orthogonal basis representation of SVD is useful for folding in the reduced representation of new documents not included in the data matrix \(D\). For example, if \(\vec X\) is a row vector of a new document, then its reduced representation is givenn by the k-dimensional vector \(\vec XV\). This type of out-of-sample embedding is harder (albeit possible) with other forms of matrix factorization.
The SVD solution provides the same error as unconstrained matrix factorization problem. Since most other forms of dimensionality reductiona are constrained matrix factorization problems, one can typically achieve a lower residual error with SVD at the same value of the rank \(k\).
lexicon: lion, tiger, cheetah, jaguar, porsche, ferrari
Cats: lion, tiger, cheetah, jaguar
Cars: japuar, porche, ferrari
\[ D = \begin{pmatrix} & \text{lion} & \text{tiger} & \text{cheetah} & \text{jaguar} & \text{porsche} & \text{ferrari} \\ \text{Document-1} & 2 & 2 & 1 & 2 & 0 & 0 \\ \text{Document-2} & 2 & 3 & 3 & 3 & 0 & 0 \\ \text{Document-3} & 1 & 1 & 1 & 1 & 0 & 0 \\ \text{Document-4} & 2 & 2 & 2 & 3 & 1 & 1 \\ \text{Document-5} & 0 & 0 & 0 & 1 & 1 & 1 \\ \text{Document-6} & 0 & 0 & 0 & 2 & 1 & 2 \end{pmatrix} \]
\[ \begin{align} \text{Maximize}_{(P,Q,\Sigma)}&[\text{Log likelihood of generating }D \text{ using parameters in matrices }(P,Q,\Sigma)] \\ &= log(\prod_{i,j}{P(\text{Adding one occurnce of term }j \text{ in document }i)^{D_{ij}}}) \\ &= \sum_{i=1}^n\sum_{j=1}^d{D_{ij}} \begin{matrix} \underbrace{ log(P(\vec X_i, t_j)) } \\ \text{Parametrized by }P,Q,\Sigma \end{matrix} \\ &\text{subject to:} \\ &P,Q,\Sigma \geq 0 \\ &\text{Entries in each column of } P \text{ sum to 1} \\ &\text{Entries in each column of } Q \text{ sum to 1} \\ &\Sigma \text{ is a diagonal matrix that sums to 1} \end{align} \]
Treat data as observations that arise from a generative probabilistic process that includes hidden variables: For documents, the hidden variables reflect the thematic structure of the collection.
Infer the hidden structure using posterior inference: What are the topics that describe this collection?
Situate new data into the estimated model: How does this query or new document fit into the estimated topic structure?
Scalability
Ability to deal with various types of data
No/less assumption about input data
Minimal requirement about domain knowledge
Interpretability and usability
Criteria to determine whether the clusters are meaningful
Internal validation
External validation
Coherence
Inter-cluster similarity v.s. intra-cluster similarity
Davies–Bouldin index
\(DB = \frac{1}{k}\sum_{i=1}^k{\underset{j \neq i}{\operatorname{max}}{(\frac{\sigma_i + \sigma_j}{d(c_i,c_j)})}}\) ← Evaluate every pair of clusters
We prefer smaller DB-index!
Coherence
Inter-cluster similarity v.s. intra-cluster similarity
Davies–Bouldin index
\(DB = \frac{1}{k}\sum_{i=1}^k{\underset{j \neq i}{\operatorname{max}}{(\frac{\sigma_i + \sigma_j}{d(c_i,c_j)})}}\) ← Evaluate every pair of clusters
We prefer smaller DB-index!
Given class label \(\Omega\) ( Required, might need extra cost) on each instance
Purity: correctly clustered documents in each cluster
\[purity(\Omega, C) = \\ \frac{1}{17}(5 + 4 + 3)\]
Given class label Ω on each instance
Normalized mutual information (NMI)
Given class label \(\Omega\) on each instance
Rand index
Idea: we want to assign two documents to the same cluster if and only if they are from the same class
\(RI = \frac{TP + TN}{TP + FP + FN + TN}\) ← Essentially it is like classification accuracy
Given class label \(\Omega\) on each instance
Precision/Recall/F-measure
\[sim(c_i, c_j) = \frac{1}{|c_i \cup c_j|(|c_i \cup c_j| - 1)}\sum_{\vec x \in (c_i \cup c_j)} \sum_{\vec y \in (c_i \cup c_j): \vec y \neq \vec x}{sim(\vec x, \vec y)}\]
Compromise between single and complete link.
Two options:
Averaged across all ordered pairs in the merged cluster
Averaged over all pairs between the two original clusters
No clear difference in efficacy
\[\vec s(c_j) = \sum_{\vec x \in c_j}{\vec x}\] - Compute similarity of clusters in constant time:
\[sim(c_i, c_j) = \frac{\vec s(c_i) + \vec s(c_j) \cdot (\vec s(c_j) + \vec s(c_j)) - (|c_i| + |c_j|)}{(|c_i| + |c_j|)(|c_i| + |c_j| - 1)}\]
Internal criterion: A good clustering will produce high quality clusters in which:
the intra-class (that is, intra-cluster) similarity is high
the inter-class similarity is low
The measured quality of a clustering depends on both the document representation and the similarity measure used
Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data
Assesses a clustering with respect to ground truth … requires labeled data
Assume documents with \(C\) gold standard classes, while our clustering algorithms produce \(K\) clusters, \(\omega_1\), \(\omega_2\), …, \(\omega_K\) with \(n_i\) members.
\[Purity(\omega_i) = \frac{1}{n_i}\max_j(n_{ij}) \ \ \ j \in C\]
Biased because having n clusters maximizes purity
Others are entropy of classes in clusters (or mutual information between classes and clusters)
Text clustering
In clustering, clusters are inferred from the data without human input (unsupervised learning)
However, in practice, it’s a bit less clear: there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents
Evaluation